Reprodukujeme následující článek a grafy v něm: (https://www.nature.com/articles/533452a)[https://www.nature.com/articles/533452a]
Jak sloučily 9 kategorií oboru výzkumu do 6?
Můžeme použít data Baker abychom z nich ukázali, jestli nějaké regiony mají horší reprodukovatelnost?
Ahoj Františku,
posílám data z Nature, článek a jejich dotaznik + otázky ke zpracování z naseho výzkumu. Odkaz na data je i v tom článku (ozn. červene). V článku nejsou zpracované např. ty geograficke informace, obory (najdes je ale v jejich datech v excelu vzadu ). Co s tim dělat - navrhuju toto:
- První věc , že bychom srovnali vše, co je v Nature článku, a co máme i my (něco, co je v Nature, naopak chybí v našem, cili to se nas netyka).
- Srovnání ostatních dat z jejich excelu s nasimi daty - především se jedná o ty geografické udaje (cili “eastern effect”). Dali bychom tam zjistění z našich dat, že to je převažně Asie, a muzeme přidat z jejich dat, jak odpovídali jejich respondenti z jednotlivých regionů (tzn. např. jestli asijsti vedci vnimají nalehavost krize jinak nez US, Evropa atd.) Plus přidame nase zjisteni ohledne strategií, jak se tomu vědci brání.
- V našem dotazníku jsou také označeny další otázky ke zpracování (v sekci Objektivita) = další problémy, které způsobují neobjektivitu (nereprodukovatelnost). To tam bude jako dalsi zjisteni krome Eastern effect.
- Popis vzorku: muzi/zeny, ak. draha, kolik dělá základní výzkum atd. (označil jsem v našem dotazníku). Pár věcí z toho dáme do článku.
Jakmile uděláš ta základní srovnání a frekvence, tak já Baker napíšu, a zjistím u ní, co má smysl dělat a co ne, pokud by to mělo jít k nim (pokud bychom to neposílali k nim, tak například do PLOS ONE). Také od ní zjistím, jestli vůbec můžeme použit ta jejich další nepublikovaná data do našeho článku (tzn. např. ty udaje o původu vědců). Případně pak můžeme udělat pár hlubších analýz, co nám přijde zajímavé.
Je také otázka, které z našich respondentů tam zařadit, já navrhuju se rozhodnout někde mezi 12- 15 minutama, a ty kratší vyřadit (než začneš).
Doufám, že jsem to popsal srozumitelně. Kdyžtak napiš, co k tomu máš, nebo si zavoláme. Já tady ted 14 dní nebudu, protože budeme cestovat. Nicméně pak bych jí rád poslal ten dotaz. A pak se tomu budu věnovat po zbytek srpna a v září (vracíme se 15.8.). V říjnu bych měl jet konečně na to LSD UK.
Uzivej červec a srpenec! Petr
PS: tu kapitolou do Virginia jsem odeslal, stravil jsem ted cca týden děláním úprav. Doufám, že to finálně projde…
Data byla pipravená pomocí skriptu Jedlicka01.Rmd, z něj
si lze vybrat dva soubory, jeden by měl být hůře připravený, ale s úplně
všemi proměnnými, na které jsme se ptali (dataAll.RData) a
druhý by měl být selektovaný, ale lépe připravený
(dataProcessed.RData). Zkusím je nedřív porovnat.
load("../data_src/dataAll.RData")
complete = data
load("../data_src/dataProcessed.RData")
selected = data
ncol(complete) == ncol(selected)
## [1] FALSE
ncol(complete) ; ncol(selected)
## [1] 161
## [1] 69
Fooo-hooo! Tak z připravených dat vypadlo skoro 100 proměnných!!! Jasně, už si to vybavuju! Celé bloky na faktory neobjektivity šly pryč, protože to nebylo primární v dosavatních textech a zprávách. Tak asi začnu tím, že se kouknu, co v Nature dělali a jestli to v obou souborech.
tibble(
`Baker Graph` = c("Is there reproducibility crisis?",
"HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?",
"Obory (Chemistry, Phycs & Engineering, Earth & Environment, Biology, Medicine, Other)",
"HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)",
"WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?",
"HAVE YOU EVER TRIED TO PUBLISH A REPRODUCTION ATTEMPT? (Successful/Unsuccessful)",
"HAVE YOU ESTABLISHED PROCEDURES FOR REPRODUCIBILITY?",
"---",
"---"),
`Baker Var` = c("(20) Which of the following statement regarding a 'crisis or reproducibility' within the science community do you agree with?",
"(22) In your opinion, what proportion of published results in your field are reproducible? i.e. the results of a given study could be replicated exactly or reproduced in multiple similar experimental systems with variations of experimental settings such as materials and experimental model)",
"(91) Which of the following best describes your area of interest?",
"(79) Which, if any, of the following have you done?, (80) ...80",
"(52--65) Please use the scale below to indicate how much each of the following factors contributes to a failure to reproduce results:, ...53 -- ...65",
"(81--84) ...81, ...82, ...83, ...84 ",
"(40) Have you and/or your lab group established any procedures to ensure reproducibility in your work?, (42) When did you and/or your lab group establish these procedures?",
"(85) Has anyone ever told you that they could not reproduce results from one of your own experiments?",
"(109) In which continent do you live?, (110--115) Which country in ..."),
`My Var: complete` = c("krize_rep",
"rep_podil",
"spec_hlavni",
"nerep_jaMuj, nerep_jaCizi",
"fakt_podvod -- fakt_smula",
"---",
"---",
"nerep_oniMuj",
"---"),
`My Var: selected` = c("krize_rep",
"---",
"spec_hlavni",
"---",
"---",
"---",
"---",
"---",
"---")
) %>% kable()
| Baker Graph | Baker Var | My Var: complete | My Var: selected |
|---|---|---|---|
| Is there reproducibility crisis? | (20) Which of the following statement regarding a ‘crisis or reproducibility’ within the science community do you agree with? | krize_rep | krize_rep |
| HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE? | (22) In your opinion, what proportion of published results in your field are reproducible? i.e. the results of a given study could be replicated exactly or reproduced in multiple similar experimental systems with variations of experimental settings such as materials and experimental model) | rep_podil | — |
| Obory (Chemistry, Phycs & Engineering, Earth & Environment, Biology, Medicine, Other) | (91) Which of the following best describes your area of interest? | spec_hlavni | spec_hlavni |
| HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone’s else) | (79) Which, if any, of the following have you done?, (80) …80 | nerep_jaMuj, nerep_jaCizi | — |
| WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH? | (52–65) Please use the scale below to indicate how much each of the following factors contributes to a failure to reproduce results:, …53 – …65 | fakt_podvod – fakt_smula | — |
| HAVE YOU EVER TRIED TO PUBLISH A REPRODUCTION ATTEMPT? (Successful/Unsuccessful) | (81–84) …81, …82, …83, …84 | — | — |
| HAVE YOU ESTABLISHED PROCEDURES FOR REPRODUCIBILITY? | (40) Have you and/or your lab group established any procedures to ensure reproducibility in your work?, (42) When did you and/or your lab group establish these procedures? | — | — |
| — | (85) Has anyone ever told you that they could not reproduce results from one of your own experiments? | nerep_oniMuj | — |
| — | (109) In which continent do you live?, (110–115) Which country in … | — | — |
Teď jsme si udělali jasno ohledně možné komparace dat z obou výzkumů.
Je jasné, že je třeba vzít dataAll.RData, neboť v těch
selektovaných datech nic pořádně na srovnání s Baker není. Teď si tedy
podle tabulky výše upravím objekt complete a z něj vyberu
vše, co je komparovatelné.
my = filter(complete, cas_sec >= 600, zakl_vyzkum == "Ano") %>%
select(krize_rep, rep_podil, spec_hlavni, nerep_jaMuj, nerep_oniMuj,
nerep_jaCizi, fakt_podvod:fakt_smula,
starts_with("reg_"), starts_with("nerep_"), starts_with("desp"), starts_with("reakce_"),
frek_nerepre:neobj_salam, gender:kar_kategorie) %>%
mutate(
Author = "Jedlička",
across(
c(krize_rep, spec_hlavni:fakt_smula, despekt, nerep_problem:nerep_narust),
~recode(
.x,
Nevím = "I don't know", Ano = "Yes", Ne = "No", `Nepamatuji se` = "I can't remember" ,
`Nedělám experimenty` = "I don't do experiments",
`Ano, významná krize` = "Yes, significant crisis",
`Ano, nevýznamná krize` = "Yes, slight crisis",
`Ne, žádná krize není` = "No crisis",
# `Astronomy and planetary science` = "Astronomy and Planetary Science",
`Mathematics` = "Other specialization",
Vždy = "Always", Nikdy = "Never",
Zřídka = "Rarely", Někdy = "Sometimes",
`Velmi často` = "Very often",
`Rozhodně souhlasím` = "Strongly agree", `Spíše souhlasím` = "Agree",
`Ani souhlasím/ani nesouhlasím` = "Neither agree nor disagree",
`Spíše nesouhlasím` = "Disagree", `Rozhodně nesouhlasím` = "Strongly disagree"
)
),
across(
starts_with("neobj_"),
~recode(.x, Ano = "100", Ne = "0") %>% parse_number()
),
across(
starts_with("frek_"),
~parse_number(.x)
)
)
## Warning: There were 16 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! 71 parsing failures.
## row col expected actual
## 60 -- a number Nevím
## 62 -- a number Nevím
## 84 -- a number Nevím
## 119 -- a number Nevím
## 127 -- a number Nevím
## ... ... ........ ......
## See problems(...) for more details.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 15 remaining warnings.
Aha… Tak když jsme dali limit 10 minut, tak jsme přišli o 20 lidí, vzorek klesne na 1001, když to dáme na 12 minut, tak to klesne až na 967, to je dalších 33 lidí, tedy přijdeme o 53 lidí, tedy 5% vzorku… No, já to teď nechám na těch 12 minutách, ale stačí říct, že by se těch dalších 33 lidí hodilo a já to celé sjedu s limitem 600 sekund. Ápropós, když limit snížím na 15 minut, tak vzorem klesne až na 878 respondentů.
tibble(`Časový limit (minuty)` = c(0, 10, 12, 15), N = c(1021, 1001, 967, 878)) %>% kable()
| Časový limit (minuty) | N |
|---|---|
| 0 | 1021 |
| 10 | 1001 |
| 12 | 967 |
| 15 | 878 |
Teď musíme provést totéž s daty od Baker. Načteme je, vybereme z nich proměnné a nakonec je budeme rekódovat tak, aby data byla srovnatelná.
# Definice vlastní funkce pro lepší pejmenování
prejmenuj = function(tdf, pozice, jmena) {
if (length(jmena) < 1) stop(print('Musíte zadat nějaká jména!'))
if (length(pozice) < 1) stop(print('Musíte zadat nějaké pozice!'))
if (length(pozice) != length(jmena)) stop(print('délka vektoru nových jmen a délka vektoru pozic se musí shodovat!'))
for (i in 1:length(jmena)) {
names(tdf)[pozice[i]] = jmena[i]
}
tdf
}
Baker = read_xlsx("../data_src/dataNature.xlsx", skip = 1) %>%
select(c(20, 22, 91, 79:80, 52:65, 85, 89, 24, 25, 50, 51)) %>%
prejmenuj(1:25,
c("krize_rep", "rep_podil", "spec_hlavni", "nerep_jaMuj", "nerep_jaCizi",
'fakt_podvod', 'fakt_karier', 'fakt_dohled', 'fakt_recenz', 'fakt_selekc',
'fakt_replik', 'fakt_statis', 'fakt_odborn', 'fakt_data', 'fakt_dokume',
'fakt_metody', 'fakt_variab', 'fakt_design', 'fakt_smula',
"nerep_oniMuj", "educ",#, "continent", "cntr.1", "cntr.2", "cntr.3", "cntr.4", "cntr.5", "cntr.6"
"nerep_problem", "nerep_problemAll", "nerep_chyba", "nerep_neobj"
)) %>%
mutate(
Author = "Baker",
across(
c(krize_rep:nerep_oniMuj),
~recode(
.x,
# `I don't know` = `Nevím`, Yes = "Ano", No = "Ne", `I can't remember` = "Nepamatuji se",
`There is a significant crisis of reproducibility` = "Yes, significant crisis",
`There is a slight crisis of reproducibility` = "Yes, slight crisis",
`There is no crisis of reproducibility` = "No crisis",
`Astronomy and planetary science` = "Astronomy and Planetary Science",
`Other` = "Other specialization",
`Always contributes` = "Always", `Never contributes` = "Never",
`Rarely contributes` = "Rarely", `Sometimes contributes` = "Sometimes",
`Very often contributes` = "Very often"
)
)
)
## Warning: Expecting logical in CZ1141 / R1141C104: got 'any'
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `No` -> `No...9`
## • `` -> `...11`
## • `` -> `...12`
## • `No` -> `No...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...26`
## • `` -> `...30`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...66`
## • `` -> `...78`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
Tak proměnné jsou vybrané a základní příprava hotová, teď můžeme soubory spojit, aby je bylo možné porovnat, dočištění, které lze udělat společně, to uděláme společně teď, tedy hned po spojení:
df = add_rows(my, Baker) %>% # Takto spojíme soubory
mutate(
rep_podil = parse_number(rep_podil),
across(
c(krize_rep, spec_hlavni:fakt_smula, despekt, Author),
~factor(
.x,
levels =
c("Yes, significant crisis", "Yes, slight crisis", "No crisis", "Yes", "No",
"I can't remember", "I don't do experiments", "Astronomy and Planetary Science", "Biology", "Chemistry",
"Earth and Environmental Science", "Engineering", "Materials Science", # "Mathematics",
"Medicine", "Physics", "Other specialization", "Baker", "Jedlička",
"Always", "Very often", "Sometimes", "Rarely", "Never", "Strongly agree", "Agree",
"Neither agree nor disagree", "Disagree", "Strongly disagree", "I don't know"
)
)
)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `rep_podil = parse_number(rep_podil)`.
## Caused by warning:
## ! 82 parsing failures.
## row col expected actual
## 46 -- a number Nevím
## 49 -- a number Nevím
## 59 -- a number Nevím
## 79 -- a number Nevím
## 85 -- a number Nevím
## ... ... ........ ......
## See problems(...) for more details.
V téhle části jednak zreplikuju grafy Monyi Baker z Nature a rovnou je srovnám s našimi výsledky. Grafy vezmu popořadě.
dfs = df %>% select(krize_rep, Author) %>% filter(!is.na(krize_rep)) %>%
group_by(Author) %>% mutate(N = n()) %>%
group_by(Author, krize_rep, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'krize_rep'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(krize_rep), x = f, fill = Author, label = paste(f, "%"))) +
geom_col(alpha = 0.75, na.rm = T, position = position_dodge2(rev = T)) +
geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
labs(title = "Is there reproducibility crisis?", x = "%",
caption = paste0("N_Baker = ", dfs[1, "N"],
", N_Jedlička = ", dfs[8, "N"])) +
# scale_x_log10() +
# guides(fill = "none") +
theme_minimal()
Tak tady jsem si všiml, a není to chyba!, že v našem výzkumu chybí 120 pozorování, ti lidi prostě neodpověděli, jiné vysvětlení nemám. My tedy máme 901 platných odpovědí, 120 chybělo, Baker má těch 1576 (obecně myslím, že tam má jenom ty, co odpověděli na všechno, ale teď si tím nejsem jistý).
Jinak, překvapení se nekoná, graf z Nature sedí, čísla jsme reprodukovali.
Ještě jsem neudělal graf a už tuším problém :( Nikde jsem nenašel dokumentaci, jak Baker rekódovala specializace. V datech je 9 kategorií (nemá matematiku jako my). Abych se s tím nedrbal donekonečna, tak jsem to zatím uděl podle těch 9/10 kategorií, co jsou v datech, pak se když tak poradíme, jak to sloučit.
dfs = df %>% select(rep_podil, spec_hlavni, Author) %>% filter(!is.na(rep_podil)) %>%
group_by(Author, spec_hlavni) %>%
mutate(N = n()) %>%
group_by(Author, rep_podil, spec_hlavni, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1), rep_podil = factor(rep_podil))
## `summarise()` has grouped output by 'Author', 'rep_podil', 'spec_hlavni'. You
## can override using the `.groups` argument.
ggplot(dfs, aes(y = rep_podil, x = f, fill = Author, label = paste(f, "%"))) +
facet_grid(cols = vars(Author), rows = vars(spec_hlavni)) +
geom_col(alpha = 0.75) +
geom_text() +
labs(title = "HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?",
# caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"],
# ", N_Jedlička = ", dfs[8, "N"])
) +
# scale_x_log10() +
guides(fill = "none") +
theme_minimal()
Tak, souhlasím, že vizualizace dle Baker hrozně tříští informaci. Proto zkusíme reprezentovat jen průměry a intervaly spolehlivosti pro tyto průměry.
dfs = df %>% select(rep_podil, spec_hlavni, Author) %>% filter(!is.na(rep_podil)) %>%
group_by(Author, spec_hlavni) %>%
summarise(mean = mean(rep_podil, na.rm = T), sd = sd(rep_podil, na.rm = T), n = n() - 1,
se = 1.96 * sd / (n ^ 0.5), l = paste0(round(mean, 0))) %>%
ungroup() %>%
mutate(spec_hlavni = fct_reorder(.f = spec_hlavni, .x = (mean * (Author == "Baker"))))
## `summarise()` has grouped output by 'Author'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = spec_hlavni, col = spec_hlavni, x = mean,
xmin = mean - se, xmax = mean + se, label = l)) +
facet_grid(cols = vars(Author)) +
geom_pointrange() +
geom_errorbarh() +
guides(color = "none") +
geom_point(alpha = 0.75, size = 10) +
geom_text(col = "white", size = 5) +
labs(title = "HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?",
# caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"],
# ", N_Jedlička = ", dfs[8, "N"])
) +
# scale_x_log10() +
guides(fill = "none") +
theme_minimal()
dfs = df %>% select(spec_hlavni, Author) %>% filter(!is.na(spec_hlavni)) %>%
group_by(Author) %>%
mutate(N = n()) %>%
group_by(Author, spec_hlavni, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'spec_hlavni'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(spec_hlavni), x = f, fill = Author, label = paste(f, "%"))) +
geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
labs(title = "Percentges of fields in both samples",
caption = paste0("N_Baker = ", dfs[1, "N"],
", N_Jedlička = ", dfs[18, "N"])
) +
# scale_x_log10() +
# guides(fill = "none") +
theme_minimal()
dfs = df %>% select(starts_with("nerep_"), Author) %>%
filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>%
pivot_longer(cols = starts_with("nerep"), names_to = "Type", names_prefix = "nerep_") %>%
mutate(Type = recode(Type, jaCizi = "I failed to\nreproduce others", jaMuj = "I failed to\nreproduce mine",
oniMuj = "They failed to\nreproduce mine") %>% factor()) %>%
group_by(Author, Type) %>%
mutate(N = n()) %>%
group_by(Author, Type, value, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'Type', 'value'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(value), x = f, fill = Type, label = paste(f, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
labs(title = "HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)",
caption = paste0("N_Baker = ", dfs[1, "N"],
", N_Jedlička = ", dfs[18, "N"])
) +
# scale_x_log10() +
# guides(fill = "none") +
theme_minimal()
dfs = df %>% select(starts_with("nerep_"), Author, spec_hlavni) %>%
filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>%
pivot_longer(cols = starts_with("nerep"), names_to = "Type", names_prefix = "nerep_") %>%
mutate(Type = recode(Type, jaCizi = "I failed to\nreproduce others", jaMuj = "I failed to\nreproduce mine",
oniMuj = "They failed to\nreproduce mine") %>% factor()) %>%
group_by(Author, Type, spec_hlavni) %>%
mutate(N = n()) %>%
group_by(Author, Type, value, N, spec_hlavni) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1)) %>%
filter(value == "Yes")
## `summarise()` has grouped output by 'Author', 'Type', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(spec_hlavni), x = f, fill = Type, label = paste(f, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
labs(title = "HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)",
x = "%",
# caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"],
# ", N_Jedlička = ", dfs[18, "N"])
) +
# scale_x_log10() +
# guides(fill = "none") +
theme_minimal()
dfs = df %>% select(starts_with("fakt_"), Author) %>%
# filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>%
pivot_longer(cols = starts_with("fakt_"), names_to = "Typ", names_prefix = "fakt_") %>%
filter(!is.na(value)) %>% mutate(Typ = factor(Typ)) %>% group_by(Author, Typ) %>%
mutate(N = n()) %>% filter(value != "I don't know") %>% mutate(N_bezNevim = n()) %>%
filter(value == "Always" | value == "Very often" | value == "Sometimes") %>%
mutate(
value = recode(value, Always = "Always/often", `Very often` = "Always/often"),
Typ =
recode(
Typ, selekc = "Selective reporting", karier = "Pressure to publish",
statis = "Low statistical power", replik = "Not replicated enough",
dohled = "Insuficient oversight", dokume = "Methods, code unavailable",
design = "Poor experimental design", data = "Raw data not available",
podvod = "Fraud", recenz = "Insufficient peer review",
odborn = "Mistakes or inadequate expertise\nin reproduction efforts",
metody = "Methods need particular\ntechnical expertise", variab = "Variability of reagents", smula = "Bad luck"
)) %>%
group_by(Author, Typ, value, N, N_bezNevim) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1),
X = if_else(value == "Always/often" & Author == "Baker", f, 0) %>% sum(),
f_bezNevim = round(n / N_bezNevim * 100, 1)
) %>% ungroup() %>% group_by(Author, Typ, N, N_bezNevim) %>%
mutate(X = sum(X)) %>% ungroup() %>%
mutate(Typ = fct_reorder(.f = Typ, .x = X)) %>% rename("Answer" = value)
## `summarise()` has grouped output by 'Author', 'Typ', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = Typ, x = f, fill = Answer, label = paste0(f, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(including 'I don't know')",
x = "%") +
theme_minimal()
ggplot(dfs, aes(y = Typ, x = f_bezNevim, fill = Answer, label = paste0(f_bezNevim, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(without 'I don't know')",
x = "%") +
theme_minimal()
dfs = df %>% select(starts_with("fakt_"), Author) %>%
# filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>%
pivot_longer(cols = starts_with("fakt_"), names_to = "Typ", names_prefix = "fakt_") %>%
filter(!is.na(value)) %>% mutate(Typ = factor(Typ)) %>% group_by(Author, Typ) %>%
mutate(N = n()) %>% filter(value != "I don't know") %>% mutate(N_bezNevim = n()) %>%
filter(value == "Always" | value == "Very often" | value == "Sometimes") %>%
mutate(
Typ =
recode(
Typ, selekc = "Selective reporting", karier = "Pressure to publish",
statis = "Low statistical power", replik = "Not replicated enough",
dohled = "Insuficient oversight", dokume = "Methods, code unavailable",
design = "Poor experimental design", data = "Raw data not available",
podvod = "Fraud", recenz = "Insufficient peer review",
odborn = "Mistakes or inadequate expertise\nin reproduction efforts",
metody = "Methods need particular\ntechnical expertise", variab = "Variability of reagents", smula = "Bad luck"
)) %>%
group_by(Author, Typ, value, N, N_bezNevim) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1),
X = if_else((value == "Always" | value == "Very often") & Author == "Baker", f, 0) %>% sum(),
f_bezNevim = round(n / N_bezNevim * 100, 1)
) %>% ungroup() %>% group_by(Author, Typ, N, N_bezNevim) %>%
mutate(X = sum(X)) %>% ungroup() %>%
mutate(Typ = fct_reorder(.f = Typ, .x = X)) %>% rename("Answer" = value)
## `summarise()` has grouped output by 'Author', 'Typ', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = Typ, x = f, fill = Answer, label = paste0(f, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(including 'I don't know')",
x = "%") +
theme_minimal()
ggplot(dfs, aes(y = Typ, x = f_bezNevim, fill = Answer, label = paste0(f_bezNevim, "%"))) +
facet_grid(cols = vars(Author)) +
geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(without 'I don't know')",
x = "%") +
theme_minimal()
Tady jsem nereprodukoval graf od Baker přesně. Dal jsem si sice záležet, abych kategorie faktorů uspořádal podle součtu odpovědí v kategoriích ‘Vždy’ a ‘Velmi často’, a tak se dalo ověřit, jestli to dělám dobře a seřadím kategorie stejně. Taky kvůli tomu, abys líp našel jednotlivé faktory v těch mých zkratkách. Ale hlavně proto, abychom se mohli podívat, jak má často ona kategorii ‘Vždy’ a jak ‘Velmi často’, tedy, aby bylo jasné, v jakém poměru se ta její sdružená kategorie skládá. A tady je to zajímavé!
Je jasně vidět, že u Baker jsou respondenti víc ultimátní, mnohem
častěji volí “Vždy”, než zvolili Češi v našem výzkumu! Češi se jedině
rozšoupnou u faktoru ‘karier’, což je Pressure to publish,
tam dají 12.5%, ale bacha, stejně nebo víc použijí ve výzkumu Baker
tuhle kategorii ‘Vždy’ zahraniční vědci u 7 faktorů ze 14! Ale ‘karier’
vybočuje u Čechů celkově – v součtu dvou nejintenzivnějších kategorií
faktor tlaku na publikování a kariéru označují dokonce častěji než
zahraniční u Baker, v Čechách je to jediný hojně uváděný typ důvodu,
ještě ‘selekc’, tj. Selektivní reportování se přehoupne
přes 50 %, ale jinak jsou všechny faktory pod 40 % a vždy méně než
Baker, ta má mimochodem pod 40 % jen 5 důvodů ze 14, my 12 ze 14.
Přemýšlím, čím to může být? Napadá mi, že (1) Češi jsou větší idealisti, nebo (2) tím, že dotazujeme ty nejlepší instituce, tak jde o ego-defense strategy, tedy nepřiznají ani sobě, jak často se ta špína děje, nebo (3) měla Baker ve vzorku i nějaká méně prestižní pracoviště, kde se s tím prostě častěji ti vědci setkali, že se tam fixlovalo, a nebo (4) Češi chápou jinak otázku než zahraniční vědci – zatím co Čech to chápe “Když se někde najde sfixlovaný výsledek, je za tím vždy faktor X” a protože to berou tak, že k fixlování vedou různé důvody, spíš sáhnou po ‘Velmi často’ nebo ‘Někdy’ jelikož jim přijde absurdní, že za všemi fixlováními by byla kariéra, podvod atd. Naproti tomu cizinci to pochopili “Když se dostane ke slovu faktor X, jak často způsobí fixlování?” Řekl bych, že Baker určitě má na mysli ten druhý mód: jak často X vede k nekalostem. Jak bychom ověřili, že to tak chápou i Češi, resp. vyloučili, že to nepochopili jako: když se objeví nekalost, může za to X.
dfs = df %>%
pivot_longer(cols = starts_with("reg_")) %>% select(Author:value) %>%
filter(!is.na(value)) %>%
mutate(
value =
recode(
value, Asie = "Asia", Evropa = "Europe", `Severní Amerika` = "North America",
`Ve všech regionech je podíl nereprodukovatelných studií stejný` =
"In all regions, the proportion of\nnon-reproducible studies is the same.",
`Nedokážu posoudit` = "I cannot tell"
)
) %>% mutate(N = n(), value = factor(value) %>% fct_infreq() %>% fct_rev()) %>%
group_by(value, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'value'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = value, x = f, fill = value, label = paste0(f, "%"))) +
geom_col() +
geom_text(position = position_stack(vjust = 0.8)) +
guides(fill = "none") +
labs(title = "In which region(s) do I think there is the largest share of\nnon-reproducible studies?",
y = "", x = "") +
theme_minimal()
# Co potřebuju?
# 1) rozsekat string na jednotlivá slova a reshapovat
# 2) překódovat víceslovné názvy na jednoslovné a opravit překlepy
# 3) udělat faktor podle frekvence
msx = str_split_fixed(df$nerep_zeme, pattern = boundary("word"), n = 164) %>% as.data.frame() %>%
rowid_to_column() %>%
pivot_longer(cols = V1:V164) %>% filter(value != "") %>%
mutate(
value =
recode(
value,
Afriky = "Afrika", Americe = "Amerika", amerika = "Amerika", USA = "Amerika",
Arabské = "Arábie", Arabi = "Arábie", arabske = "Arábie", Saudi = "Arábie", Blízký = "Arábie",
asie = "Asie", Asii = "Asie", asijských = "Asie",
Cina = "Čína", Ćína = "Čína", Čina = "Čína", čína = "Čína", ČÍNA = "Čína", Čínská = "Čína", china = "Čína",
China = "Čína", Číně = "Čína", čínská = "Čína", čínských = "Čína", čínským = "Čína", CN = "Čína",
Evropě = "Evropa",
India = "Indie", indie = "Indie", INDIE = "Indie",
Iran = "Írán", Irán = "Írán", Italie = "Itálie",
Pakiskan = "Pákistán", Pakistán = "Pákistán",
SSSR = "Rusko", ruských = "Rusko", russia = "Rusko",
Ale = "ale", Fabulované = "fab", JV = "jv", Kde = "kde", # Překódování slov, které nejsou země, aby zmizely.
Mimochodem = "mmchd", Momentálně = "mmnt", Nedokážu = "nedokážu", Nejde = "nejde", Nelze = "nejde",
Nejvíce = "nej", Neda = "ne", Nemám = "ne", Nemohu = "ne", Neřekla = "ne", Obávám = "obv", Obecně = "obc",
Odhadovala = "odh", Podíl = "pod", Prosím = "pls", Rozvíjející = "roz", Saudská = "saud",
Severní = "sev", Spíš = "spíš", Tam = "tam", To = "to", Toto = "to", V = "v", Velká = "vel",
Všude = "all", Východ = "vých", Bývalé = "býv", Bývalý = "býv"
)) %>%
filter(str_detect(value, pattern = '[:upper:]')) %>% filter(value != "Nevím") %>%
mutate(
value =
recode(
value, Čína = "China", Indie = "India", Amerika = "USA",
Rusko = "Russia", Írán = "Iran", Asie = "Asia", Pákistán = "Pakistan",
Arábie = "Arabic countries", Japonsko = "Japan", Afrika = "Africa", Turecko = "Turkey"),
U = unique(rowid) %>% length(),
value = as_factor(value) %>% fct_infreq() %>% fct_rev(),
N = n()) %>%
group_by(value, N, U) %>% summarise(n = n()) %>% filter(n > 2) %>%
mutate(f = round(n / nrow(my) * 100, 1))
## `summarise()` has grouped output by 'value', 'N'. You can override using the
## `.groups` argument.
# 'nrow(my)' is size of sample,
# 'U' is number of respondents giving at least one answer,
# 'N' is number of mentions
ggplot(msx, aes(y = value, x = f, label = paste0(f, "%"), fill = value)) +
geom_col(width = 0.8) +
geom_text(position = position_stack(vjust = 0.5)) +
guides(fill = "none") +
labs(title = "Which country or countries do you think have the highest\nproportion of non-reproducible research?",
y = "Country", x = "",
caption = paste0("Note: Respondents might mention more than one country.\nWe ask ", nrow(my),
" respondents to mention countries, from ", msx[1, "U"], " respondents we received ",
msx[1, "N"], " mentions."),
subtitle = paste0("(N=", nrow(my), ")")) +
theme_minimal()
dfs = df %>% filter(!is.na(despekt)) %>%
mutate(N = n()) %>% group_by(despekt, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'despekt'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = despekt, y = f, label = paste0(f, "%"), fill = despekt)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = "I have noticed a disdain from foreign scientists in my field for Czech scientists\nor for Czech science (because of lower standards of scientific work, etc.)?", x = "", y = "%") +
guides(fill = "none") +
theme_minimal()
Bude doplněno.
dfs = df %>% select(starts_with("desp_"), -desp_zeme) %>% rowid_to_column() %>%
pivot_longer(cols = starts_with("desp_"), names_prefix = "desp_") %>%
filter(!is.na(value)) %>%
mutate(
name =
recode(
name, nevim = "I cannot tell.", jazyk = "It was a linguistic issue,\nas I am not a native speaker.",
kvalita = "It was related to the quality\nof the scientific work.", vlastni = "Other"
) %>% factor() %>% fct_infreq() %>% fct_rev(),
U = unique(rowid) %>% length(), N = n()
) %>% group_by(U, N, name) %>% summarise(n = n()) %>%
mutate(f = round(n / U * 100, 1))
## `summarise()` has grouped output by 'U', 'N'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = name, x = f, fill = name, label = paste0(f, "%"))) +
geom_col() +
geom_text(position = position_stack(vjust = 0.8)) +
guides(fill = "none") +
labs(title = paste0("If I have noted this disrespect:\n(N=", dfs[1, "U"], ")"),
caption = paste0("Note: Respondents might chose up to three answers, ", dfs[1, "U"],
" respondents gave at least one answer."),
y = "", x = "%") +
theme_minimal()
dfs = df %>% select(starts_with("reakce_")) %>% rowid_to_column() %>%
pivot_longer(cols = starts_with("reakce_"), names_prefix = "reakce_") %>%
filter(!is.na(value)) %>%
mutate(
name =
recode(
name, nectu = "I stop reading studies\nfrom that region altogether.",
vyberu = "I only choose labs that I believe\nare producing high quality science.",
obezret = "I continue to read studies from this region,\nbut I am cautious about their conclusions.",
nic = "Not at all, because even studies from this\nregion can contain valuable results.",
stejne = "In my field, studies from all\nregions are of equal quality.",
nevim = "I cannot tell.", vlastni = "Other"
) %>% factor() %>% fct_infreq() %>% fct_relevel("Other", after = 7) %>%
fct_relevel("I cannot tell.", after = 7) %>% fct_rev(),
U = unique(rowid) %>% length(), N = n()
) %>% group_by(U, N, name) %>% summarise(n = n()) %>%
mutate(f = round(n / U * 100, 1))
## `summarise()` has grouped output by 'U', 'N'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = name, x = f, fill = name, label = paste0(f, "%"))) +
geom_col() +
geom_text(position = position_stack(vjust = 0.8)) +
guides(fill = "none") +
labs(title = paste0("How do I react when I find that scientific studies from\ncertain regions are more likely to have flaws or\nare non-reproducible?\n(N=", dfs[1, "U"], ")"),
caption = paste0("Note: Respondents might chose up to three answers, ", dfs[1, "U"],
" respondents gave at least one answer."),
y = "", x = "%") +
theme_minimal()
Uděláme si na to pěkný graf! Každé položce spočítáme dva údaje:
dfs = df %>% select(starts_with("frek_"), starts_with("neobj_")) %>%
pivot_longer(cols = everything()) %>% filter(!is.na(value)) %>%
group_by(name) %>% summarise(f = round(mean(value, na.rm = T), 1)) %>%
separate(name, sep = "_", into = c("dim", "item")) %>%
pivot_wider(id_cols = item, names_from = dim, values_from = f) %>%
mutate(
item =
recode(
item,
salam = "Using the 'salami method', where the results are published in multiple papers (possibly duplicated).",
popular = "Insertion of 'popular' terms (e.g. in texts or articles) for the sole purpose of improving publishability.",
positiv = "Publishing only positive or statistically relevant results.",
nerepre = "Unjustified generalization of results where data are interpreted in too broad a context (insufficient large sample or unrepresentative sample).",
hacking = "P-hacking and similar strategies (data-dredging, significance-chasing, harking).",
zbytecne = "Publication of unimportant studies whose scientific contribution is questionable.",
oversell = "Deliberate exaggeration of scientific results ('overselling').",
modmet = "Use of 'fashionable' methods in research without scientific justification (e.g. in grant applications).")
)
ggplot(dfs,
aes(x = frek, y = neobj, col = item,
label = str_wrap(paste0(item, ": X = ", frek, "%, Y = ", neobj, "%."), 30))) +
geom_label_repel(size = 3) +
geom_point(size = 5, alpha = 0.3) +
guides(col = "none") +
labs(title = "How often do I encounter the following phenomena in my field? VS.\nDo I think the following phenomena distort the objectivity of science?",
y = "Do I think the following phenomena distort the objectivity of science? (%)",
x = "How often do I encounter the following phenomena in my field? (%)") +
theme_minimal()
dfs = df %>% filter(!is.na(gender)) %>%
mutate(gender = recode(gender, Muž = "Male", Žena = "Female", Ostatní = "Other") %>%
factor() %>% fct_infreq(),
N = n()) %>% group_by(gender, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = gender, y = f, label = paste0(f, "%"), fill = gender)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = paste0("I am:\n(N=", dfs$N[1],")"), x = "Gender", y = "%") +
guides(fill = "none") +
theme_minimal()
dfs = df %>% filter(!is.na(vek)) %>%
mutate(vek = recode(vek, `65 a více` = "65+"),
N = n()) %>% group_by(vek, N) %>% summarise(n = n()) %>%
mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'vek'. You can override using the `.groups`
## argument.
ggplot(dfs, aes(x = vek, y = f, label = paste0(f, "%"), fill = vek)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = paste0("My age is:\n(N=", dfs$N[1],")"), x = "Gender", y = "%") +
guides(fill = "none") +
theme_minimal()
dfs = df %>% filter(!is.na(kar_praxe)) %>%
mutate(
kar_praxe =
recode(kar_praxe, `5–9 let` = "5–9 years", `10–19 let` = "10–19 years",
`20 a více let` = "20+ years", `Méně než 5 let` = "<5 years") %>%
factor(levels = c("<5 years", "5–9 years", "10–19 years", "20+ years")),
N = n()) %>%
group_by(kar_praxe, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_praxe'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_praxe, y = f, label = paste0(f, "%"), fill = kar_praxe)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = "How many years do I work in research:", x = "", y = "") +
guides(fill = "none") +
theme_minimal()
Má to cenu překládat? Zájímá to mimo ČR vůbec někoho?
df %>% filter(!is.na(kar_titul)) %>%
mutate(
kar_titul = fct_relevel(kar_titul, "Mgr./Ing.", after = 1),
N = n()) %>%
group_by(kar_titul, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1)) %>%
ggplot(aes(x = kar_titul, y = f, label = paste0(f, "%"), fill = kar_titul)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = "My the best academic rank is:", x = "", y = "") +
guides(fill = "none") +
theme_minimal()
## `summarise()` has grouped output by 'kar_titul'. You can override using the
## `.groups` argument.
Tady mi jenom napadá – nevyhodíme ty Bc. respondenty? Jsou to asi 4 lidi, nebo 5 a zvýší to snad důvěryhodnost výsledků. U magistrů, inženýrů bych hodně váhal, to je 130 lidí.
Foo hoo! Tak jinak! Baker má data od spousty lidí, co jsou “jen” Ph.D. studenti, takže Mgr./Ing. rozhodně nechat, Baker jich má taky hafo: PhD Students má 410 (26.02%), Post-doctoral Fellows má 314 (19.92%), dohromady tyto dvě skupiny tvoří 46 % vzorku! To je taky možná důvod těch rozdílů, že těchto “relativních juniorů” je v tom jejím výzkumu mnohem víc než v tom našem.
dfs = df %>% filter(!is.na(kar_citace)) %>%
mutate(
kar_citace =
recode(kar_citace, `Do 100` = "0–100", `Do 500` = "101–500",
`Do 1000` = "501–1000", `Do 5000` = "1001–5000", Více = "5000+") %>%
factor(levels = c("0–100", "101–500", "501–1000", "1001–5000", "5000+")),
N = n()) %>%
group_by(kar_citace, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_citace'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_citace, y = f, label = paste0(f, "%"), fill = kar_citace)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = "Number of citations:", x = "Citations", y = "") +
guides(fill = "none") +
theme_minimal()
dfs = df %>% filter(!is.na(kar_nejlepsi)) %>%
mutate(
kar_nejlepsi =
recode(kar_nejlepsi, Ano = "Yes", Ne = "No") %>%
factor(levels = c("Yes", "No")),
N = n()) %>%
group_by(kar_nejlepsi, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_nejlepsi'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_nejlepsi, y = f, label = paste0(f, "%"), fill = kar_nejlepsi)) +
geom_col(width = 0.7) +
geom_text(position = position_stack(vjust = 0.5)) +
labs(title = "I got published (as co-author, as well) in the best journal in the field: ", x = "", y = "") +
guides(fill = "none") +
theme_minimal()
Asi jen pro nás, proto to zatím nepřekládám, jestli se Ti to bude na něco hodit, dej vědět. Taky tu nechávám absolutní počty. Aby to bylo použitelné, budeme to muset sjednotit napříč institucemi.
df %>% filter(!is.na(kar_kategorie)) %>%
mutate(
kar_kategorie = fct_relevel(kar_kategorie, "Doktorand do třídy V2", after = 2) # Pozor! Pozice této kategorie zůstane v datech stejná!
) %>%
ggplot(aes(y = kar_kategorie , fill = kar_kategorie)) +
geom_bar(width = 0.7) +
labs(title = "", x = "Počty respondentů", y = "Kategorie") +
guides(fill = "none") +
theme_minimal()
Tak a tady si uděláme 3 verze jednoho grafu, neb Baker se ptá na 4 výroky ohledně neroprodukovatelnosti a z nich se my ptáme na 3 a ještě na další 3, takže slušný mr*ník :-)
dfs = df %>% select(Author, starts_with("nerep_"), -c(nerep_zeme, nerep_jaMuj:nerep_jaCizi)) %>%
pivot_longer(cols = starts_with("nerep_"), names_prefix = "nerep_") %>%
filter(!is.na(value)) %>%
mutate(
name =
recode(
name, problem = "I think that the failure to reproduce scientific studies is a major problem in my field.",
problemAll = "I think that the failure to reproduce scientific studies is a major problem for all fields.",
chyba = "I think that a failure to reproduce a result most often means that the original finding is wrong.",
valid = "Nereproducibility detracts the validity of the original finding.",
neobj = "I think that a failure to reproduce rarely detracts from the validity of the original finding.",
veda = "Also irreproducible results are the part of the science.",
narust = "The fraction of irreproducibile findings has increased during my carrier."
),
value = factor(value, levels = c("Strongly agree", "Agree", "Neither agree nor disagree", "Disagree",
"Strongly disagree", "I don't know"))
) %>% group_by(Author, name) %>%
mutate(N = n()) %>% group_by(Author, name, N, value) %>%
summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'name', 'N'. You can override
## using the `.groups` argument.
dfs %>%
ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
facet_grid(cols = vars(Author)) +
geom_col(position = position_stack(reverse = T), alpha = 0.8) +
geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
# guides(label = "none") +
theme_minimal() +
theme(legend.position = "bottom")
dfs %>%
filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
name == "I think that the failure to reproduce scientific studies is a major problem for all fields." |
name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>%
ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
facet_grid(cols = vars(Author)) +
geom_col(position = position_stack(reverse = T), alpha = 0.8) +
geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
# guides(label = "none") +
theme_minimal() +
theme(legend.position = "bottom")
dfs %>%
filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>%
ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
facet_grid(cols = vars(Author)) +
geom_col(position = position_stack(reverse = T), alpha = 0.8) +
geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
# guides(label = "none") +
theme_minimal() +
theme(legend.position = "bottom")
dfsx = df %>% select(Author, starts_with("nerep_"), -c(nerep_zeme, nerep_jaMuj:nerep_jaCizi)) %>%
pivot_longer(cols = starts_with("nerep_"), names_prefix = "nerep_") %>%
filter(!is.na(value), value != "I don't know") %>%
mutate(
name =
recode(
name, problem = "I think that the failure to reproduce scientific studies is a major problem in my field.",
problemAll = "I think that the failure to reproduce scientific studies is a major problem for all fields.",
chyba = "I think that a failure to reproduce a result most often means that the original finding is wrong.",
valid = "Nereproducibility detracts the validity of the original finding.",
neobj = "I think that a failure to reproduce rarely detracts from the validity of the original finding.",
veda = "Also irreproducible results are the part of the science.",
narust = "The fraction of irreproducibile findings has increased during my carrier."
),
value = factor(value, levels = c("Strongly agree", "Agree", "Neither agree nor disagree", "Disagree",
"Strongly disagree", "I don't know"))
) %>% group_by(Author, name) %>%
mutate(N = n()) %>% group_by(Author, name, N, value) %>%
summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'name', 'N'. You can override
## using the `.groups` argument.
dfsx %>%
filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>%
ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
facet_grid(cols = vars(Author)) +
geom_col(position = position_stack(reverse = T), alpha = 0.8) +
geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
# guides(label = "none") +
theme_minimal() +
theme(legend.position = "bottom")